2025-06-02-12-08
Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve
Abstract
arXiv:2505.23946v1 Announce Type: new Abstract: Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.
摘要
近期研究表明,大型语言模型(LLMs)具备不同技能并擅长不同任务。事实上,我们观察到其性能差异存在于多个粒度层级。例如在代码优化任务中,代码LLMs在不同优化类别上各有所长,没有单一模型能全面占优。这一现象引发了一个问题:如何在未知模型互补优势的前提下,利用多个LLM智能体协同解决编码问题。我们认为,智能体团队可以通过相互学习成功与失败经验来提升个体性能。因此,我们将"经验"定义为智能体在集体求解过程中产生并传递给其他成员的知识。本研究提出基于经验协作的框架,设计经验征集-存储-选择机制,并证明通过经验共享的小型LLM团队,其性能可超越单个大型LLM及其他多LLM协作方法。
EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving
Abstract
arXiv:2505.23970v1 Announce Type: new Abstract: As large language models (LLMs) become widely used, their environmental impact\unicode{x2014}especially carbon emissions\unicode{x2014}has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present EmbAdvisor, a carbon-aware caching framework that selects the optimal cache size for LLM serving. EmbAdvisor profiles different LLM tasks and uses an Integer Linear Programming (ILP) solver to select cache sizes that meet SLOs while minimizing total carbon emissions. Overall, EmbAdvisor reduces the average carbon emissions of a Llama-3 70B model by 9.5% under various carbon intensities compared to a non-adaptive cache scenario, and can save up to 31.2% when the carbon intensity is low.
摘要
随着大语言模型(LLMs)的广泛应用,其环境影响——尤其是碳排放问题——日益受到关注。现有研究主要关注计算相关的碳排放,本文发现存储同样是关键因素。LLM缓存技术通过保存并复用重复上下文的KV缓存来避免冗余计算,从而降低运行碳排放,但这一优势需以高容量高速固态硬盘的隐含碳排放为代价。随着LLM规模扩大,存储设备的隐含碳排放显著增长。为平衡这一矛盾,我们提出EmbAdvisor——一个碳感知缓存框架,可为LLM服务选择最优缓存规模。该框架通过分析不同LLM任务特征,运用整数线性规划(ILP)求解器在满足服务等级目标(SLO)的同时最小化总碳排放。实验表明,相较于非自适应缓存方案,EmbAdvisor能使Llama-3 70B模型在各种碳强度下的平均碳排放降低9.5%,在低碳强度场景下最高可减少31.2%的碳排放。
SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference
Abstract
arXiv:2505.24095v1 Announce Type: new Abstract: Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we introduce SkyLB, a locality-aware multi-region load balancer for LLM inference that aggregates regional diurnal patterns through cross-region traffic handling. By doing so, SkyLB enables providers to reserve instances based on expected global demand, rather than peak demand in each individual region. Meanwhile, SkyLB preserves KV-Cache locality and a balanced load, ensuring cost efficiency without sacrificing performance. SkyLB achieves this with a cache-aware cross-region traffic handler and a selective pushing load balancing mechanism based on checking pending requests. Our evaluation on real-world workloads shows that it achieves 1.12-2.06x higher throughput and 1.74-6.30x lower latency compared to existing load balancers, while reducing total serving cost by 25%.
摘要
在多区域部署中高效服务大型语言模型(LLM)仍面临挑战。出于成本和GPU可用性考虑,提供商通常使用长期承诺实例(如预留实例或本地集群)在多区域部署LLM,这些实例由于仅处理区域本地流量和昼夜流量波动而经常利用率不足。本文提出SkyLB——一种面向LLM推理的感知局部性多区域负载均衡器,通过跨区域流量处理聚合区域昼夜模式。这使得提供商可以根据预期全球需求而非单个区域峰值需求来预留实例。同时,SkyLB保持KV缓存局部性和均衡负载,在保证性能前提下实现成本效益。其核心技术包括缓存感知的跨区域流量处理器和基于待处理请求检查的选择性推送负载均衡机制。实际工作负载评估表明,相比现有负载均衡器,SkyLB实现了1.12-2.06倍的吞吐量提升和1.74-6.30倍的延迟降低,同时将总服务成本降低25%。
MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge
Abstract
arXiv:2505.23982v1 Announce Type: new Abstract: Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.
摘要
尽管大规模语言模型(LLMs)在材料科学领域取得了最新进展,但目前仍缺乏评估其领域专业知识和复杂推理能力的基准测试。为填补这一空白,我们提出了MSQA——一个包含1,757道研究生级别材料科学问题的综合评估基准,提供详细解释性回答和二元真/假判断两种形式。MSQA通过要求模型在七个材料科学子领域(如结构-性能关系、合成工艺和计算建模等)同时具备精确的事实知识和多步推理能力,对LLMs形成了独特挑战。通过对10个最先进LLMs的实验测试,我们发现当前模型性能存在显著差距:基于API的专有LLMs最高达到84.5%准确率,开源(OSS)LLMs峰值约为60.5%,而领域专用LLMs因过拟合和分布偏移问题表现普遍欠佳。MSQA是首个能联合评估LLMs事实掌握与推理能力的基准测试,这两项能力对先进材料科学领域的LLMs应用至关重要。
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Abstract
arXiv:2505.24073v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
摘要
大型视觉语言模型(LVLMs)在视觉问答、视觉定位和复杂推理等多模态任务中取得了显著进展。然而,它们仍受限于静态训练数据、易产生幻觉以及无法根据最新外部证据验证主张等问题,这影响了其在动态现实应用中的表现。检索增强生成(RAG)通过让LVLMs借助检索机制访问大规模知识库,将模型输出基于事实性、上下文相关的信息,为缓解这些挑战提供了实用解决方案。本文首次系统剖析了面向LVLMs的多模态RAG流程,具体研究:(1)检索阶段:探讨模态配置与检索策略;(2)重排序阶段:研究减轻位置偏差和提高检索证据相关性的策略;(3)生成阶段:深入分析如何最优整合检索候选集至最终生成过程。最后,我们进一步探索通过自反思整合重排序与生成的统一代理框架,使LVLMs能动态选择相关证据并抑制无关上下文。针对LVLMs的RAG全栈研究获得了重要洞见,在无需微调的情况下平均性能提升达5%。
Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution
Abstract
arXiv:2505.24037v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable success across various tasks but face deployment challenges due to their massive computational demands. While post-training pruning methods like SparseGPT and Wanda can effectively reduce the model size, but struggle to maintain model performance at high sparsity levels, limiting their utility for downstream tasks. Existing fine-tuning methods, such as full fine-tuning and LoRA, fail to preserve sparsity as they require updating the whole dense metrics, not well-suited for sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a novel method designed specifically for sparse LLMs. SEFT dynamically evolves the sparse topology of pruned models during fine-tuning, while preserving the overall sparsity throughout the process. The strengths of SEFT lie in its ability to perform task-specific adaptation through a weight drop-and-grow strategy, enabling the pruned model to self-adapt its sparse connectivity pattern based on the target dataset. Furthermore, a sensitivity-driven pruning criterion is employed to ensure that the desired sparsity level is consistently maintained throughout fine-tuning. Our experiments on various LLMs, including LLaMA families, DeepSeek, and Mistral, across a diverse set of benchmarks demonstrate that SEFT achieves stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: https://github.com/QiaoXiao7282/SEFT.
摘要
大型语言模型(LLMs)虽在各种任务中取得显著成功,但其庞大的计算需求导致部署面临挑战。尽管SparseGPT和Wanda等训练后剪枝方法能有效缩减模型规模,但在高稀疏度下难以保持模型性能,限制了其在下游任务中的应用。现有微调方法(如全参数微调和LoRA)由于需更新整个稠密矩阵而无法保持稀疏性,并不适用于稀疏LLMs。本文提出稀疏性演化微调(SEFT),这是一种专为稀疏LLMs设计的新方法。SEFT在微调过程中动态演化剪枝模型的稀疏拓扑结构,同时全程保持整体稀疏度。其优势在于通过权重丢弃-生长策略实现任务自适应,使剪枝模型能根据目标数据集自我调整稀疏连接模式。此外,采用敏感度驱动的剪枝准则确保微调过程中始终维持目标稀疏度。我们在LLaMA系列、DeepSeek和Mistral等多种LLMs上的多基准测试表明,SEFT在保持更优内存和时间效率的同时,实现了比现有基线更强的性能。代码已开源:https://github.com/QiaoXiao7282/SEFT。
Using Reasoning Models to Generate Search Heuristics that Solve Open Instances of Combinatorial Design Problems
Abstract
arXiv:2505.23881v1 Announce Type: new Abstract: Large Language Models (LLMs) with reasoning are trained to iteratively generate and refine their answers before finalizing them, which can help with applications to mathematics and code generation. We apply code generation with reasoning LLMs to a specific task in the mathematical field of combinatorial design. This field studies diverse types of combinatorial designs, many of which have lists of open instances for which existence has not yet been determined. The Constructive Protocol CPro1 uses LLMs to generate search heuristics that have the potential to construct solutions to small open instances. Starting with a textual definition and a validity verifier for a particular type of design, CPro1 guides LLMs to select and implement strategies, while providing automated hyperparameter tuning and execution feedback. CPro1 with reasoning LLMs successfully solves long-standing open instances for 7 of 16 combinatorial design problems selected from the 2006 Handbook of Combinatorial Designs, including new solved instances for 3 of these (Bhaskar Rao Designs, Symmetric Weighing Matrices, Balanced Ternary Designs) that were unsolved by CPro1 with non-reasoning LLMs. It also solves open instances for several problems from recent (2025) literature, generating new Covering Sequences, Johnson Clique Covers, Deletion Codes, and a Uniform Nested Steiner Quadruple System.
摘要
具备推理能力的大语言模型(LLMs)经过训练,可在最终确定答案前迭代生成并优化结果,这有助于数学及代码生成领域的应用。本研究将基于推理LLMs的代码生成技术应用于组合设计数学领域的特定任务。该领域研究多种类型的组合设计,其中许多存在尚未确定存在性的开放实例列表。构造协议CPro1利用LLMs生成搜索启发式方法,这些方法有望为小型开放实例构建解决方案。CPro1从特定设计的文本定义和有效性验证器出发,引导LLMs选择并实施策略,同时提供自动化超参数调优和执行反馈。采用推理LLMs的CPro1成功解决了选自2006年《组合设计手册》的16个组合设计问题中7个长期未决的开放实例,其中包括3个非推理LLMs版CPro1未能解决的新实例(Bhaskar Rao设计、对称称重矩阵、平衡三元设计)。该方法还解决了近期(2025年)文献中多个问题的开放实例,生成了新的覆盖序列、Johnson团覆盖、删除码以及一个均匀嵌套Steiner四重系统。
GenIC: An LLM-Based Framework for Instance Completion in Knowledge Graphs
Abstract
arXiv:2505.24036v1 Announce Type: new Abstract: Knowledge graph completion aims to address the gaps of knowledge bases by adding new triples that represent facts. The complexity of this task depends on how many parts of a triple are already known. Instance completion involves predicting the relation-tail pair when only the head is given (h, ?, ?). Notably, modern knowledge bases often contain entity descriptions and types, which can provide valuable context for inferring missing facts. By leveraging these textual descriptions and the ability of large language models to extract facts from them and recognize patterns within the knowledge graph schema, we propose an LLM-powered, end-to-end instance completion approach. Specifically, we introduce GenIC: a two-step Generative Instance Completion framework. The first step focuses on property prediction, treated as a multi-label classification task. The second step is link prediction, framed as a generative sequence-to-sequence task. Experimental results on three datasets show that our method outperforms existing baselines. Our code is available at https://github.com/amal-gader/genic.
摘要
知识图谱补全旨在通过添加表示事实的新三元组来填补知识库的空白。该任务的复杂程度取决于三元组中已知部分的数量。实例补全任务要求在仅给定头实体时预测关系-尾实体对(h, ?, ?)。值得注意的是,现代知识库通常包含实体描述和类型,这些信息可为推断缺失事实提供有价值的上下文。通过利用这些文本描述以及大语言模型从中提取事实并识别知识图谱模式规律的能力,我们提出了一种基于LLM的端到端实例补全方法。具体而言,我们提出了GenIC:一个两阶段的生成式实例补全框架。第一阶段将属性预测视为多标签分类任务,第二阶段则将链接预测构建为生成式序列到序列任务。在三个数据集上的实验结果表明,我们的方法优于现有基线。代码已开源:https://github.com/amal-gader/genic。
Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding
Abstract
arXiv:2505.23990v1 Announce Type: new Abstract: To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.
摘要
在人类社会中有效发挥作用的关键能力,在于适应变化环境、过滤信息并做出明智决策。随着机器人和智能体日益融入人类生活,将人类的认知负担转移至这些系统——尤其是在动态且信息丰富的场景中——正形成重要机遇与需求。为满足这一关键需求,我们提出Multi-RAG:一种多模态检索增强生成系统,旨在信息密集型场景中为人类提供自适应辅助。该系统通过整合并推理视频、音频和文本等多源信息流,以提升情境理解能力并降低认知负荷。作为实现长期人机协作的基础步骤,Multi-RAG探索了多模态信息理解如何成为动态人本场景中自适应机器人辅助的基石。为评估其在现实人类辅助代理任务中的能力,我们在MMBench-Video数据集(一个具有挑战性的多模态视频理解基准)上对Multi-RAG进行了测试。相较于现有开源视频大语言模型(Video-LLMs)和大视觉语言模型(LVLMs),我们的系统在消耗更少资源和输入数据的情况下实现了更优性能。结果表明,Multi-RAG具备作为动态现实场景中未来人机自适应辅助系统的实用高效基础架构潜力。
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Abstract
arXiv:2505.23885v1 Announce Type: new Abstract: Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
摘要
基于大语言模型(LLM)的多智能体系统在自动化现实任务方面展现出潜力,但由于其领域特定性,难以实现跨领域迁移。现有方法存在两个关键缺陷:应用于新领域时需要完全重构架构并重新训练所有组件。我们提出Workforce——一种分层多智能体框架,通过模块化架构实现战略规划与专业执行的解耦,该架构包含:(i)用于任务分解的领域无关规划器;(ii)子任务管理协调器;(iii)具备领域特定工具调用能力的专业化工作器。这种解耦设计在推理和训练阶段均支持跨领域迁移:推理时通过增减或修改工作器即可无缝适配新领域;训练阶段我们提出优化工作器学习(OWL),通过基于现实反馈的强化学习优化领域无关规划器,提升跨领域泛化能力。在GAIA基准测试中,Workforce覆盖多种现实跨领域智能体任务,实验结果表明其以69.70%的准确率取得开源领域最优性能,较OpenAI深度研究等商业系统高出2.34%。值得注意的是,经OWL训练的320亿参数模型达到52.73%准确率(提升16.37%),在挑战性任务上表现媲美GPT-4o。本研究通过实现可扩展的泛化能力和模块化领域迁移,为下一代通用人工智能助手奠定了基础。
An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring
Abstract
arXiv:2505.24239v1 Announce Type: new Abstract: While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system's effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.
摘要
尽管多智能体大语言模型系统在多个领域展现出强大能力,但其极易受到对抗性智能体和低性能智能体的影响。为解决这一问题,本文提出一种基于可信度评分的通用抗对抗多智能体大语言模型框架。我们将协作式问答过程建模为迭代博弈,其中智能体通过通信共同生成最终系统输出。该系统通过可信度评分来聚合团队输出,该评分根据各智能体在历史问答中的贡献度逐步学习获得。我们在多种任务和场景下的实验表明,即使对抗性智能体占多数,本系统仍能有效减轻对抗性影响,增强多智能体协作的鲁棒性。
InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
Abstract
arXiv:2505.23950v1 Announce Type: new Abstract: As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .
摘要
随着多模态大模型(MLLMs)在各类挑战性任务中不断取得进展,一个关键问题随之浮现:当前模型仍缺失哪些核心能力?人类学习的关键特征在于与环境的持续交互——这种交互不仅限于语言,还涉及多模态理解与生成。为实现更接近人类水平的智能,模型必须同样支持多轮次、多模态的交互。尤其需要具备对交织多模态上下文的理解能力,并在持续对话中作出连贯响应。本研究通过InterMT数据集展开初步探索——这是首个基于真实人类反馈构建的多轮多模态交互偏好数据集。在此过程中,我们特别强调人类监督的重要性,通过引入专家标注来指导流程,这源于当前MLLMs确实缺乏此类复杂交互能力的客观事实。InterMT将人类偏好从全局和局部两个层面细分为九个子维度,包含15.6k条提示、52.6k个多轮对话实例及32.4k组人工标注的偏好对。为弥补多模态理解与生成能力的不足,我们提出一种代理工作流,利用工具增强的MLLMs来构建多轮问答实例。为进一步推进目标,我们推出InterMT-Bench评估框架,用于衡量MLLMs在辅助裁判完成多轮多模态任务时的表现。通过裁判模型调节等应用场景,我们验证了InterMT的实用价值,并揭示了裁判模型的多轮扩展规律。我们希望开源数据能促进学界对现有MLLMs进行更深入的对齐研究。项目网站详见https://pku-intermt.github.io。
ProofNet++: A Neuro-Symbolic System for Formal Proof Verification with Self-Correction
Abstract
arXiv:2505.24230v1 Announce Type: new Abstract: We propose ProofNet++, a neuro-symbolic framework that enhances automated theorem proving by combining large language models (LLMs) with formal proof verification and self-correction mechanisms. Current LLM-based systems suffer from hallucinated logical steps and unverifiable reasoning. ProofNet++ mitigates these limitations by integrating symbolic proof tree supervision, a reinforcement learning loop using verifiers as reward functions, and an iterative self-correction module. Our experiments on miniF2F, Lean's mathlib, and HOL Light show that ProofNet++ significantly improves proof accuracy, correctness, and formal verifiability over prior models. We provide theoretical analysis of the convergence and stability of the verifier-guided RL framework and release our datasets and codebase for future research.
摘要
我们提出ProofNet++,一种神经符号框架,通过将大型语言模型(LLMs)与形式化证明验证及自我修正机制相结合,增强了自动定理证明能力。当前基于LLM的系统存在逻辑步骤虚构和推理不可验证的问题。ProofNet++通过整合符号化证明树监督、采用验证器作为奖励函数的强化学习循环以及迭代式自我修正模块,有效缓解了这些局限性。在miniF2F、Lean的mathlib和HOL Light上的实验表明,ProofNet++较先前模型显著提升了证明准确性、正确性及形式化可验证性。我们对验证器引导的强化学习框架的收敛性和稳定性进行了理论分析,并公开了数据集与代码库以供后续研究。
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Abstract
arXiv:2505.24208v1 Announce Type: new Abstract: Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs' safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs' safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.
摘要
确保视觉-语言模型(VLM)生成安全输出对其可靠部署至关重要。然而,与纯文本大语言模型(LLM)相比,多模态大语言模型(LVLM)存在显著的安全性退化问题。即便是空白或不相关图像,也可能触发LVLM对原本在纯文本环境下会拒绝的提示生成有害回应。近期研究假设图像与文本表征之间的模态差异是导致LVLM安全性退化的原因之一,但模态差异程度如何影响LVLM安全性尚未得到研究。本工作首次揭示模态差异程度与VLM安全性呈高度负相关,并证明该差异产生于LVLM预训练阶段且能持续影响微调过程。基于此发现,我们提出一种预训练阶段的模态差异正则化方法。在LLaVA v1.5、ShareGPT4V和MiniGPT-4上的大量实验表明:该方法在不影响性能的前提下,最高可降低16.3%的不安全响应率;当与现有防御机制结合时,能进一步提升18.2%的安全防护效果。
E^2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness
Abstract
arXiv:2505.24226v1 Announce Type: new Abstract: Graph-based RAG methods like GraphRAG have shown promising global understanding of the knowledge base by constructing hierarchical entity graphs. However, they often suffer from inefficiency and rely on manually pre-defined query modes, limiting practical use. In this paper, we propose E^2GraphRAG, a streamlined graph-based RAG framework that improves both Efficiency and Effectiveness. During the indexing stage, E^2GraphRAG constructs a summary tree with large language models and an entity graph with SpaCy based on document chunks. We then construct bidirectional indexes between entities and chunks to capture their many-to-many relationships, enabling fast lookup during both local and global retrieval. For the retrieval stage, we design an adaptive retrieval strategy that leverages the graph structure to retrieve and select between local and global modes. Experiments show that E^2GraphRAG achieves up to 10 times faster indexing than GraphRAG and 100 times speedup over LightRAG in retrieval while maintaining competitive QA performance.
摘要
基于图的检索增强生成方法(如GraphRAG)通过构建层次化实体图谱,展现出对知识库全局理解的良好潜力。然而,这类方法通常存在效率低下且依赖人工预定义查询模式的问题,限制了实际应用。本文提出E^2GraphRAG——一个高效且有效的流线型图式RAG框架。在索引阶段,E^2GraphRAG利用大语言模型构建摘要树,并基于文档块通过SpaCy生成实体图。随后建立实体与文档块间的双向索引以捕获其多对多关系,从而实现局部与全局检索时的快速查找。在检索阶段,我们设计了自适应检索策略,通过图结构动态选择局部或全局检索模式。实验表明,E^2GraphRAG的索引速度较GraphRAG提升高达10倍,检索速度较LightRAG快100倍,同时保持具有竞争力的问答性能。
SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems
Abstract
arXiv:2505.24201v1 Announce Type: new Abstract: The rise of large language model (LLM)-based multi-agent systems (MAS) introduces new security and reliability challenges. While these systems show great promise in decomposing and coordinating complex tasks, they also face multi-faceted risks across prompt manipulation, unsafe tool usage, and emergent agent miscoordination. Existing guardrail mechanisms offer only partial protection, primarily at the input-output level, and fall short in addressing systemic or multi-point failures in MAS. In this work, we present a system-level anomaly detection framework tailored for MAS, integrating structural modeling with runtime behavioral oversight. Our approach consists of two components. First, we propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels. Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning. By bridging abstract detection logic with actionable enforcement, our method detects not only single-point faults and prompt injections but also multi-agent collusion and latent exploit paths. We validate our framework through two case studies, including an email assistant and Microsoft's Magentic-One system, demonstrating its ability to detect covert risks and provide explainable root-cause attribution. Our work lays the foundation for more trustworthy, monitorable, and secure agent-based AI ecosystems.
摘要
基于大语言模型(LLM)的多智能体系统(MAS)的兴起带来了新的安全性与可靠性挑战。尽管这类系统在分解和协调复杂任务方面展现出巨大潜力,但其仍面临提示词操纵、工具不安全使用以及智能体协同失调等多维度风险。现有防护机制仅能提供输入输出层面的局部保护,难以应对MAS中的系统性或多点故障。本研究提出一个面向MAS的系统级异常检测框架,将结构建模与运行时行为监控相结合。该框架包含两个核心组件:首先,我们设计基于图的建模方法,将智能体交互表示为动态执行图,实现节点、边和路径层面的语义异常检测;其次,我们引入可插拔的哨兵智能体(SentinelAgent),这个由LLM驱动的监督智能体能基于安全策略和上下文推理,对MAS执行过程进行观测、分析和干预。通过将抽象检测逻辑与可执行措施相衔接,本方法不仅能检测单点故障和提示词注入攻击,还能识别多智能体共谋和潜在攻击路径。我们通过电子邮件助手和微软Magentic-One系统两个案例验证了该框架,证明其可有效检测隐蔽风险并提供可解释的根因溯源。本研究为构建更可信、可监测且安全的智能体AI生态系统奠定了基础。
Learning API Functionality from Demonstrations for Tool-based Agents
Abstract
arXiv:2505.24197v1 Announce Type: new Abstract: Digital tool-based agents that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose learning API functionality directly from demonstrations as a new paradigm applicable in scenarios without documentation. Using existing API benchmarks, we collect demonstrations from both expert API-based agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 5 models show that learning functionality from demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent's task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.
摘要
基于数字工具、通过调用外部应用程序接口(API)的智能体通常依赖文档来理解API功能。然而此类文档往往存在缺失、过时、私有化或不一致等问题,这阻碍了开发可靠通用型智能体的进程。本研究提出直接从演示中学习API功能的新范式,适用于无文档支持的场景。利用现有API基准测试集,我们分别从专家API智能体和自主探索中收集演示数据。为明确演示必须传递何种信息才能成功完成任务,我们深入研究了演示数量、大语言模型生成的摘要与评估对API智能体任务成功率的影响。通过在3个数据集和5个模型上的实验表明,即使对于最先进的大语言模型,从演示中学习功能仍是一项非平凡挑战。研究发现,提供显式函数调用和自然语言评述能显著提升智能体任务成功率,这主要归因于参数填充准确性的提高。我们分析了故障模式,识别错误来源,并重点指出了未来无文档、自改进API智能体研究面临的关键开放性问题。
SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought
Abstract
arXiv:2505.24181v1 Announce Type: new Abstract: Chain of Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step by step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain of Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT (Stepwise Cognitive Optimization Using Teachers), a lightweight fine tuning framework that enables Flow CoT style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross attention based retrospective module that integrates outputs from previous iterations while preserving the models original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1.8% gains under fine tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.
摘要
思维链(CoT)提示通过鼓励逐步思考来提升大语言模型(LLMs)的推理性能。然而,基于CoT的方法依赖于中间推理步骤,这限制了其可扩展性和泛化能力。近期研究探索了递归推理方法,使LLMs在迭代中复用内部层以优化潜在表征,而无需显式的CoT监督。尽管前景可观,这些方法通常需要昂贵的预训练,且缺乏关于推理应如何跨迭代演进的原则性框架。为此,我们提出流思维链(Flow CoT),这是一种将递归推理建模为潜在认知状态渐进轨迹的推理范式。Flow CoT将每次迭代视为深化推理的独立认知阶段,无需依赖人工监督。为实现这一目标,我们提出SCOUT(基于教师的分步认知优化)——一个轻量级微调框架,可在无需预训练的情况下实现Flow CoT式推理。SCOUT采用渐进式蒸馏使每次迭代与适当容量的教师模型对齐,并通过基于交叉注意力的回顾模块整合先前迭代的输出,同时保留模型原始计算流。在八个推理基准上的实验表明,SCOUT持续提升了准确性和解释质量,在微调条件下最高获得1.8%的性能提升。定性分析进一步揭示,SCOUT能实现跨迭代的渐进深度推理,优化信念形成和解释粒度。这些结果不仅验证了SCOUT的有效性,也证明了Flow CoT作为增强LLMs推理能力的可扩展框架具有实际可行性。
FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation
Abstract
arXiv:2505.24258v1 Announce Type: new Abstract: Understanding how data moves, transforms, and persists, known as data flow, is fundamental to reasoning in procedural tasks. Despite their fluency in natural and programming languages, large language models (LLMs), although increasingly being applied to decisions with procedural tasks, have not been systematically evaluated for their ability to perform data-flow reasoning. We introduce FABLE, an extensible benchmark designed to assess LLMs' understanding of data flow using structured, procedural text. FABLE adapts eight classical data-flow analyses from software engineering: reaching definitions, very busy expressions, available expressions, live variable analysis, interval analysis, type-state analysis, taint analysis, and concurrency analysis. These analyses are instantiated across three real-world domains: cooking recipes, travel routes, and automated plans. The benchmark includes 2,400 question-answer pairs, with 100 examples for each domain-analysis combination. We evaluate three types of LLMs: a reasoning-focused model (DeepSeek-R1 8B), a general-purpose model (LLaMA 3.1 8B), and a code-specific model (Granite Code 8B). Each model is tested using majority voting over five sampled completions per prompt. Results show that the reasoning model achieves higher accuracy, but at the cost of over 20 times slower inference compared to the other models. In contrast, the general-purpose and code-specific models perform close to random chance. FABLE provides the first diagnostic benchmark to systematically evaluate data-flow reasoning and offers insights for developing models with stronger procedural understanding.
摘要
理解数据如何移动、转换和持久化(即数据流)是进行程序性任务推理的基础。尽管大语言模型(LLMs)在自然语言和编程语言方面表现出色,并越来越多地应用于程序性任务决策,但其数据流推理能力尚未得到系统评估。我们提出了FABLE——一个可扩展的基准测试,旨在利用结构化程序文本来评估LLMs对数据流的理解能力。FABLE适配了软件工程中的八种经典数据流分析:到达定义、非常繁忙表达式、可用表达式、活跃变量分析、区间分析、类型状态分析、污点分析以及并发分析。这些分析实例化在三个现实领域:烹饪食谱、旅行路线和自动化计划。该基准包含2,400个问答对,每个领域-分析组合有100个示例。我们评估了三类LLMs:专注推理的模型(DeepSeek-R1 8B)、通用模型(LLaMA 3.1 8B)和代码专用模型(Granite Code 8B)。每个模型通过每个提示五次采样补全的多数投票进行测试。结果表明,推理模型准确率更高,但推理速度比其他模型慢20倍以上;而通用模型和代码专用模型的表现接近随机猜测。FABLE提供了首个系统性评估数据流推理的诊断基准,并为开发具有更强程序理解能力的模型提供了见解。
GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments
Abstract
arXiv:2505.24306v1 Announce Type: new Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated their potential in planning and reasoning tasks, offering a flexible alternative to classical pathfinding algorithms. However, most existing studies focus on LLMs' independent reasoning capabilities and overlook the potential synergy between LLMs and traditional algorithms. To fill this gap, we propose a comprehensive evaluation benchmark GridRoute to assess how LLMs can take advantage of traditional algorithms. We also propose a novel hybrid prompting technique called Algorithm of Thought (AoT), which introduces traditional algorithms' guidance into prompting. Our benchmark evaluates six LLMs ranging from 7B to 72B parameters across various map sizes, assessing their performance in correctness, optimality, and efficiency in grid environments with varying sizes. Our results show that AoT significantly boosts performance across all model sizes, particularly in larger or more complex environments, suggesting a promising approach to addressing path planning challenges. Our code is open-sourced at https://github.com/LinChance/GridRoute.
摘要
大语言模型(LLMs)的最新进展展示了其在规划与推理任务中的潜力,为传统路径搜索算法提供了灵活的替代方案。然而,现有研究大多关注LLMs的独立推理能力,忽视了LLMs与传统算法间的协同潜力。为填补这一空白,我们提出综合性评估基准GridRoute,用以评估LLMs如何利用传统算法优势。同时,我们提出一种新型混合提示技术"思维算法"(AoT),将传统算法指导引入提示过程。该基准测试了参数量从70亿到720亿不等的六种LLM在不同地图尺寸下的表现,评估其在各尺寸网格环境中正确性、最优性和效率方面的性能。结果表明,AoT能显著提升所有规模模型的性能,尤其在更大或更复杂的环境中,为解决路径规划挑战提供了可行方案。代码已开源:https://github.com/LinChance/GridRoute。
Mind the Quote: Enabling Quotation-Aware Dialogue in LLMs via Plug-and-Play Modules
Abstract
arXiv:2505.24292v1 Announce Type: new Abstract: Human-AI conversation frequently relies on quoting earlier text-"check it with the formula I just highlighted"-yet today's large language models (LLMs) lack an explicit mechanism for locating and exploiting such spans. We formalise the challenge as span-conditioned generation, decomposing each turn into the dialogue history, a set of token-offset quotation spans, and an intent utterance. Building on this abstraction, we introduce a quotation-centric data pipeline that automatically synthesises task-specific dialogues, verifies answer correctness through multi-stage consistency checks, and yields both a heterogeneous training corpus and the first benchmark covering five representative scenarios. To meet the benchmark's zero-overhead and parameter-efficiency requirements, we propose QuAda, a lightweight training-based method that attaches two bottleneck projections to every attention head, dynamically amplifying or suppressing attention to quoted spans at inference time while leaving the prompt unchanged and updating < 2.8% of backbone weights. Experiments across models show that QuAda is suitable for all scenarios and generalises to unseen topics, offering an effective, plug-and-play solution for quotation-aware dialogue.
摘要
人机对话经常需要引用先前文本——“用我刚高亮的公式核对”——然而当前大型语言模型(LLMs)缺乏定位和利用此类文本段的显式机制。我们将该挑战形式化为跨度条件生成问题,将每个对话轮次分解为对话历史、一组词符偏移的引用跨度以及意图话语。基于此抽象框架,我们提出以引用为中心的数据处理流程:自动生成任务特定对话,通过多阶段一致性检查验证答案正确性,最终产出异构训练语料库和首个涵盖五种典型场景的基准测试集。为满足该基准的零开销与参数高效要求,我们提出QuAda——一种轻量级训练方法,该方法在每个注意力头附加双重瓶颈投影,在推理时动态增强或抑制对引用跨度的注意力,同时保持提示不变且仅更新<2.8%的主干权重。跨模型实验表明,QuAda适用于所有场景并能泛化至未见主题,为引用感知对话提供了即插即用的有效解决方案。
RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation
Abstract
arXiv:2505.24442v1 Announce Type: new Abstract: Although multi-agent systems based on large language models show strong capabilities on multiple tasks, they are still limited by high computational overhead, information loss, and robustness. Inspired by ResNet's residual learning, we propose Residual Mixture-of-Agents (RMoA), integrating residual connections to optimize efficiency and reliability. To maximize information utilization from model responses while minimizing computational costs, we innovatively design an embedding-based diversity selection mechanism that greedily selects responses via vector similarity. Furthermore, to mitigate iterative information degradation, we introduce a Residual Extraction Agent to preserve cross-layer incremental information by capturing inter-layer response differences, coupled with a Residual Aggregation Agent for hierarchical information integration. Additionally, we propose an adaptive termination mechanism that dynamically halts processing based on residual convergence, further improving inference efficiency. RMoA achieves state-of-the-art performance on the benchmarks of across alignment, mathematical reasoning, code generation, and multitasking understanding, while significantly reducing computational overhead. Code is available at https://github.com/mindhunter01/RMoA.
摘要
尽管基于大语言模型的多智能体系统在多项任务中展现出强大能力,但其仍受限于高计算开销、信息丢失和鲁棒性问题。受ResNet残差学习的启发,我们提出残差智能体混合架构(RMoA),通过集成残差连接来优化效率与可靠性。为在最大化模型响应信息利用率的同时最小化计算成本,我们创新性地设计了一种基于嵌入向量的多样性选择机制,通过向量相似度贪婪地筛选响应。此外,为缓解迭代过程中的信息退化问题,我们引入残差提取智能体来捕获层间响应差异以保留跨层增量信息,并结合残差聚合智能体实现层次化信息整合。我们还提出自适应终止机制,根据残差收敛情况动态停止处理,进一步提升推理效率。RMoA在指令对齐、数学推理、代码生成和多任务理解等基准测试中均达到最先进性能,同时显著降低了计算开销。代码已开源:https://github.com/mindhunter01/RMoA。
Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success
Abstract
arXiv:2505.24622v1 Announce Type: new Abstract: Predicting startup success requires models that are both accurate and interpretable. We present a lightweight ensemble framework that combines YES/NO questions generated by large language models (LLMs), forming a transparent decision-making system. Each question acts as a weak heuristic, and by filtering, ranking, and aggregating them through a threshold-based voting mechanism, we construct a strong ensemble predictor. On a test set where 10% of startups are classified as successful, our approach achieves a precision rate of 50%, representing a 5x improvement over random selection, while remaining fully transparent. When we incorporate expert-guided heuristics into the generation process, performance improves further to 54% precision. These results highlight the value of combining LLM reasoning with human insight and demonstrate that simple, interpretable ensembles can support high-stakes decisions in domains such as venture capital (VC).
摘要
预测初创企业成功需要兼具准确性与可解释性的模型。我们提出一种轻量级集成框架,通过整合大型语言模型(LLMs)生成的二元问题,构建透明决策系统。每个问题作为弱启发式规则,经过基于阈值的投票机制进行筛选、排序和聚合后,形成强集成预测器。在成功初创企业占比10%的测试集中,该方法达到50%的精确率,较随机选择提升5倍,同时保持完全透明性。当引入专家指导的启发式规则至生成过程时,精确率进一步提升至54%。这些结果凸显了LLM推理与人类洞察相结合的价值,证明简单可解释的集成模型能够支持风险投资等高风险领域的决策。
SEAR: A Multimodal Dataset for Analyzing AR-LLM-Driven Social Engineering Behaviors
Abstract
arXiv:2505.24458v1 Announce Type: new Abstract: The SEAR Dataset is a novel multimodal resource designed to study the emerging threat of social engineering (SE) attacks orchestrated through augmented reality (AR) and multimodal large language models (LLMs). This dataset captures 180 annotated conversations across 60 participants in simulated adversarial scenarios, including meetings, classes and networking events. It comprises synchronized AR-captured visual/audio cues (e.g., facial expressions, vocal tones), environmental context, and curated social media profiles, alongside subjective metrics such as trust ratings and susceptibility assessments. Key findings reveal SEAR's alarming efficacy in eliciting compliance (e.g., 93.3% phishing link clicks, 85% call acceptance) and hijacking trust (76.7% post-interaction trust surge). The dataset supports research in detecting AR-driven SE attacks, designing defensive frameworks, and understanding multimodal adversarial manipulation. Rigorous ethical safeguards, including anonymization and IRB compliance, ensure responsible use. The SEAR dataset is available at https://github.com/INSLabCN/SEAR-Dataset.
摘要
SEAR数据集是一种新型多模态资源,旨在研究通过增强现实(AR)和多模态大语言模型(LLM)实施的社会工程(SE)攻击这一新兴威胁。该数据集收录了60名参与者在模拟对抗场景(包括会议、课堂和社交活动)中的180段标注对话,包含同步采集的AR视觉/音频线索(如面部表情、语调)、环境上下文、精选社交媒体资料,以及信任评级和易感性评估等主观指标。关键发现表明SEAR在诱导服从(93.3%的钓鱼链接点击率、85%的电话接听率)和劫持信任(76.7%的交互后信任激增)方面具有惊人效力。本数据集支持检测AR驱动的社会工程攻击、设计防御框架及理解多模态对抗操纵等研究。通过数据匿名化和机构审查委员会合规等严格伦理保障措施确保其负责任使用。SEAR数据集发布于https://github.com/INSLABCN/SEAR-Dataset。
Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning
Abstract
arXiv:2505.24478v1 Announce Type: new Abstract: Integrating Large Language Models (LLMs) with Knowledge Graphs (KGs) results in complex systems with numerous hyperparameters that directly affect performance. While such systems are increasingly common in retrieval-augmented generation, the role of systematic hyperparameter optimization remains underexplored. In this paper, we study this problem in the context of Cognee, a modular framework for end-to-end KG construction and retrieval. Using three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, and MuSiQue) we optimize parameters related to chunking, graph construction, retrieval, and prompting. Each configuration is scored using established metrics (exact match, F1, and DeepEval's LLM-based correctness metric). Our results demonstrate that meaningful gains can be achieved through targeted tuning. While the gains are consistent, they are not uniform, with performance varying across datasets and metrics. This variability highlights both the value of tuning and the limitations of standard evaluation measures. While demonstrating the immediate potential of hyperparameter tuning, we argue that future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems.
摘要
将大语言模型(LLMs)与知识图谱(KGs)集成会形成具有众多直接影响性能的超参数的复杂系统。尽管此类系统在检索增强生成中日益普遍,但系统性超参数优化的作用仍未得到充分探索。本文以Cognee(一个端到端知识图谱构建与检索的模块化框架)为背景研究该问题。通过使用三个多跳问答基准数据集(HotPotQA、TwoWikiMultiHop和MuSiQue),我们优化了与文本分块、图谱构建、检索及提示工程相关的参数。每种配置均采用现有指标(精确匹配、F1值及DeepEval基于LLM的正确性指标)进行评分。结果表明,通过针对性调优可获得显著性能提升。虽然增益具有一致性,但不同数据集和指标间存在性能差异,这种差异性既凸显了调优的价值,也揭示了标准评估方法的局限性。在证明超参数调优即时潜力的同时,我们认为未来进展不仅取决于架构创新,更需建立针对复杂模块化系统的优化与评估的清晰框架。
How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning
Abstract
arXiv:2505.24273v1 Announce Type: new Abstract: Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.
摘要
大型语言模型(LLM)近期取得的突破性进展,通过监督微调(SFT)和强化学习(RL)等技术,有效提升了其在可验证答案的数学与逻辑问题上的推理能力。已有研究表明,RL能有效内化搜索策略,实现长链思维(CoT)推理,而回溯能力会自然习得。然而回溯的具体优势——尤其是其对推理改进的实际贡献度及最佳使用程度——仍缺乏深入理解。本研究系统探究了SFT与RL在八项推理任务(倒计时、数独、Arc一维问题、几何、立方体颜色旋转、列表函数、斑马谜题和自我参照)中的动态关系。研究发现:与冷启动RL相比,SFT阶段采用的短CoT序列确实对RL训练有适度贡献,但随着任务难度增加,这种贡献逐渐减弱。基于此发现,我们构建了回溯步数系统变化的合成数据集,通过控制实验分离了内容正确性与结构特征(即回溯频率)的影响。实验表明:(1)含回溯的长CoT通常能带来更好更稳定的RL训练效果;(2)搜索空间较大的复杂问题往往需要在SFT阶段使用更高频次回溯。此外,通过蒸馏数据的实验证明,RL训练基本不受长CoT序列内容正确性的影响,表明RL更关注结构模式而非内容准确性。这些发现为设计最优训练策略以有效扩展LLM的推理能力提供了实践指导。
Mixture-of-Experts for Personalized and Semantic-Aware Next Location Prediction
Abstract
arXiv:2505.24597v1 Announce Type: new Abstract: Next location prediction plays a critical role in understanding human mobility patterns. However, existing approaches face two core limitations: (1) they fall short in capturing the complex, multi-functional semantics of real-world locations; and (2) they lack the capacity to model heterogeneous behavioral dynamics across diverse user groups. To tackle these challenges, we introduce NextLocMoE, a novel framework built upon large language models (LLMs) and structured around a dual-level Mixture-of-Experts (MoE) design. Our architecture comprises two specialized modules: a Location Semantics MoE that operates at the embedding level to encode rich functional semantics of locations, and a Personalized MoE embedded within the Transformer backbone to dynamically adapt to individual user mobility patterns. In addition, we incorporate a history-aware routing mechanism that leverages long-term trajectory data to enhance expert selection and ensure prediction stability. Empirical evaluations across several real-world urban datasets show that NextLocMoE achieves superior performance in terms of predictive accuracy, cross-domain generalization, and interpretability
摘要
下一位置预测在理解人类移动模式中具有关键作用。然而现有方法存在两个核心局限:(1) 难以捕捉现实场景中地点复杂的多功能语义;(2) 缺乏对不同用户群体异构行为动态的建模能力。为解决这些问题,我们提出NextLocMoE框架,该框架基于大语言模型(LLM)构建,采用双层级混合专家(MoE)设计。架构包含两个专用模块:在嵌入层运作的"位置语义MoE"用于编码地点的丰富功能语义,以及嵌入Transformer主干网的"个性化MoE"动态适配个体移动模式。此外,我们引入历史感知路由机制,利用长期轨迹数据优化专家选择并确保预测稳定性。在多个真实城市数据集上的实证评估表明,NextLocMoE在预测精度、跨域泛化性和可解释性方面均表现出优越性能。
Leveraging Knowledge Graphs and LLMs for Structured Generation of Misinformation
Abstract
arXiv:2505.24479v1 Announce Type: new Abstract: The rapid spread of misinformation, further amplified by recent advances in generative AI, poses significant threats to society, impacting public opinion, democratic stability, and national security. Understanding and proactively assessing these threats requires exploring methodologies that enable structured and scalable misinformation generation. In this paper, we propose a novel approach that leverages knowledge graphs (KGs) as structured semantic resources to systematically generate fake triplets. By analyzing the structural properties of KGs, such as the distance between entities and their predicates, we identify plausibly false relationships. These triplets are then used to guide large language models (LLMs) in generating misinformation statements with varying degrees of credibility. By utilizing structured semantic relationships, our deterministic approach produces misinformation inherently challenging for humans to detect, drawing exclusively upon publicly available KGs (e.g., WikiGraphs). Additionally, we investigate the effectiveness of LLMs in distinguishing between genuine and artificially generated misinformation. Our analysis highlights significant limitations in current LLM-based detection methods, underscoring the necessity for enhanced detection strategies and a deeper exploration of inherent biases in generative models.
摘要
错误信息的迅速传播在生成式人工智能最新进展的推波助澜下,对社会构成重大威胁,影响公众舆论、民主稳定和国家安全。要理解并主动评估这些威胁,需要探索能够实现结构化、可扩展错误信息生成的方法论。本文提出一种创新方法,利用知识图谱(KGs)作为结构化语义资源来系统生成虚假三元组。通过分析知识图谱的结构特性(如实体间距离及其谓词关系),我们识别出具有潜在虚假性的关联关系。这些三元组随后用于指导大语言模型(LLMs)生成具有不同可信度的错误信息陈述。我们的确定性方法通过利用结构化语义关系,仅基于公开知识图谱(如WikiGraphs)即可生成人类难以识别的固有性错误信息。此外,我们探究了大语言模型在区分真实信息与人工生成错误信息方面的有效性。分析结果表明,当前基于LLM的检测方法存在显著局限性,这凸显了加强检测策略以及深入探索生成模型固有偏见的必要性。
MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge
Abstract
arXiv:2505.24493v1 Announce Type: new Abstract: Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments' results demonstrate a consistence performance improvement on SER.
摘要
尽管语音情感识别(SER)在深度学习推动下取得显著进展,但标注工作仍是主要障碍。人工标注不仅成本高昂,且存在不一致性问题——标注者往往具有不同偏好并可能缺乏必要的情境知识,这会导致标签存在差异且不准确。与此同时,大型语言模型(LLMs)已成为文本数据标注的可扩展替代方案。然而,LLMs在无需人工监督情况下完成语音情感数据标注的潜力尚未得到充分研究。为解决这些问题,我们应用GPT-4o对情景剧《老友记》收集的多模态数据集进行标注,仅使用文本线索作为输入。通过设计结构化文本提示,我们的方法充分利用了GPT-4o在训练过程中积累的知识,证明其无需接触多模态输入即可生成准确且符合情境的标注。据此我们提出MELT——首个完全由GPT-4o标注的多模态情感数据集。通过微调四个自监督学习(SSL)骨干网络并跨情感数据集评估语音情感识别性能,我们验证了MELT的有效性。此外,主观实验结果表明该方法能持续提升SER性能。